This analysis as part of Udacity's Data Analyst Nanodegree, is to show that insights can be discovered and buisness questions can be answered by cleaning, aggregating, and visualizing tabular data and the relationships between that data's fields. The dataset is gathered from The Movie Database, and contains information for over 10,000 movies. This information includes genre, director, actor, revenue, budget, and other pertinent variables.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
movie_df = pd.read_csv("https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv")
movie_df.info()
movie_df.head()
This data has several columns related to movie revenue, voter ratings, and cast and crew. However for cast specifically all of the actors are placed into one field seperated by a pipe operator. For analyzing information about the lead actor the first actor from the actor column must be extracted and placed in its own column for further analysis.
def get_lead(cast_str):
"""
Get lead actor from a cast list.
"""
cast_list = cast_str.split("|")
return cast_list[0]
clean_df = movie_df.dropna(subset=["cast"])
clean_df.info()
import warnings
warnings.filterwarnings('ignore')
clean_df['lead_actor'] = clean_df['cast'].apply(get_lead)
clean_df.head()
results = clean_df.groupby(['lead_actor'])['revenue_adj'].sum()
results_df = pd.DataFrame(results)
results_df.sort_values(by='revenue_adj', ascending=False).head()
import numpy as np
pivot_tbl = pd.pivot_table(
clean_df,
values = ['revenue_adj', 'revenue'],
index=['release_year', 'lead_actor'],
aggfunc=np.sum
)
non_zero_tbl = pivot_tbl.loc[pivot_tbl['revenue_adj'] != 0, :]
max_values = non_zero_tbl.groupby(['release_year'])['revenue_adj'].max()
max_records = []
for index, value in max_values.iteritems():
max_records.append(pivot_tbl.loc[pivot_tbl['revenue_adj'] == value, ['revenue_adj']])
yearly_actors = pd.concat(max_records).sort_values('release_year').reset_index(level=1)
yearly_actors.head()
yearly_actors['revenue_adj'].plot(kind='line')
plt.title('Lead Actor Pay Over Time')
plt.xlabel('Year')
plt.ylabel('Adjusted Revenue USD')
ax = plt.gca()
ax.yaxis.grid(linestyle='--')
The lead actor who brought in the most revenue is Tom Cruise and there are several actors who were the most profitable for multiple years. Those actors include Mark Hamill, Harrison Ford and Johnny Depp. There is also a slight upward trend in how much movie revenue lead actors over time, however there appears to be a periodic cycle in maximum lead actor revenue over the same period.
numerical_df = movie_df.select_dtypes(include=["float64", "int"])
compare_df = numerical_df.drop(columns=['id', 'release_year'])
compare_df.head()
compare_df.plot(x='vote_average', y='budget', kind='scatter', alpha=0.5);
plt.title("Budget Over Vote Average")
plt.xlabel("Vote Average (score)")
plt.ylabel("Budget (USD)");
compare_df.plot(x='vote_average', y='revenue_adj', kind='scatter', alpha=0.5);
plt.title("Budget Over Vote Average")
plt.xlabel("Vote Average (score)")
plt.ylabel("Adjusted Revenue (USD)");
compare_df.plot(x='vote_average', y='popularity', kind='scatter', alpha=0.5);
plt.title("Budget Over Vote Average")
plt.xlabel("Vote Average (score)")
plt.ylabel("Popularity Rating");
For budget, adjusted revenue, and popularity rating there appears to be a positive linear correlation with voting ratings. However this trend isn't as observable near smaller values with a wider distribution of vote values being present. This is definitely a topic for further research as this suggests that these variables could potentially be predicted using vote scores in combination with other variables.
This analysis shows that using data cleaning, aggregations and visualization, business questions and trends that weren't expected can be revealed and clearly communicated. These results can then be used to make informed decisions or ask even deeper deeper questions involving inferential statistics or machine learning.